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Overview. A significant hurdle confronts the software reuser attempting to se- 
lect candidate components from a software repository - discriminating between 
those components without resorting to inspection of the implementation(s). We 
outline an approach to this problem based upon neural networks which avoids 
requiring the repository administrators to define a conceptual closeness graph for 
the classification vocabulary. 


1 Introduction 

Reuse has long been an accepted principle in many scientific disciplines. Biologists 
use established laboratory instruments to record experimental results; chemists use 
standardized measuring devices. Engineers design based upon the availability of 
components that facilitate product development. It is unreasonable to expect an 
electrical engineer to design and develop the transistor from first principles every 
time one is required. 

Software engineers, however, are frequently guilty of a comparable practice 
in their discipline. The reasons for this are as varied as the environments in which 
software is developed, but they usually include the following: 

*To appear in Neural Networks and Pattern Recognition in Human Computer Interfaces, R. 
Beale and J. Findlay (eds.), Ellis Horwood Ltd., West Sussex, UK, due out March, 1992. 
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• a lack of development standards; 

• the not invented here syndrome; 

• poor programming language support for the mechanical act of reuse; and 

• poor support in identifying, cataloging, and retrieving reuse candidates. 

The first three items involve organization mentality, and will not be ad- 
dressed here. 1 We instead focus upon the final item in this list, the nature of the 
repository itself, and more specifically upon the mechanisms provided for classifi- 
cation and retrieval of components from the repository. 

The complexity of non-trivial software components and their supporting 
documentation easily qualifies reuse as a “wicked” problem - frequently intractable 
in both descnption and solution. We describe an approach that we are currently 
exploring for making classification and retrieval mechanisms more efficient and 
natural for the software reuser. This approach centers around the use of neural 
networks in support of imprecise classification and querying. 


2 The Problem 

A mature software repository can contain thousands of components, each with 
its own specification, interface, and typically, its own vocabulary. Consider the 
signatures presented in Figures 1 and 2 for a stack of integers and a queue of 
integers, respectively. 


Create: =*> Stack 
Push: Stack x Integer Stack 
Pop: Stack =$■ Stack 
Top: Stack =>• Integer 
Empty: Stack Boolean 


Figure 1: Signature of a Stack 


Concerning language support - there are languages which readily support reuse, but they 
must be available to the programmers. Consider for a moment the inertia exhibited by FOR- 
TRAN and COBOL in commercial data processing. The very existence of such large bodies 
of code u languages ill-suited for reuse acts as an inhibitor for the movement of organizations 
towards better suited languages. 
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Create: => Queue 
Enqueue: Queue x Integer=>- Queue 
Dequeue: Queue Queue 
Front: Queue => Integer 
Empty: Queue =$> Boolean 


Figure 2: Signature of A Queue 

These signatures are isomorphic up to renaming, and thus exemplify what 
we have come to refer to as the vocabulary problem. Software reusers implicitly 
associate distinct semantics with particular names, for example, pop and enqueue. 
Thus, by the choice of names, a component developer can mislead reusers as 
to the semantics of components, or provide no means of discriminating between 
components. Figure 3, for example, appears to be equally applicable as a signature 
for both stack and queue, primarily due to the neutral nature of the names used. 

Create: => Sequence 
Insert: Sequence x Integer => Sequence 
Remove: Sequence =$► Sequence 
Current: Sequence =$■ Integer 
Empty: Sequence => Boolean 


Figure 3: Signature of a Sequence 


3 Software Classification 

Retrieval mechanisms for software repositories have traditionally provided some 
sort of classification structure in support of user queries. Keyword-based retrieval 
is perhaps the most common of these classification structures, but keywords are 
ill-suited to domains with rich structure and complex semantics. This section lays 
out the principle representational problems in software classification and selected 
solutions to them. 
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3.1 Literary Warrant 

Library scientists use literary warrant for the classification of texts. Representative 
samples drawn from the set of works generate a set of descriptive terms, which 
in turn generate a classification of the works as a whole. The adequacy of the 
classification system hinges a great deal on the initial choice of samples. 

With appropriate tools, literary warrant in software need not restrict itself 
to a sample of the body of works. Rather, it can examine each of the individual 
works in turn, providing vocabularies for each of them. This may indeed be 
required in repositories where the component coverage in a particular area is sparse. 


3.2 Conceptual Closeness 


The vocabulary of terms built up through literary warrant typically contains a 
great deal of semantic overlap words whose meanings are the same, or at least 
similar. For instance, two components, one implementing a stack and the other 
a queue might both be characterized with the word insert, corresponding to push 
and enqueue, respectively, as discussed in section 2. 

Synonym ambiguity is commonly resolved through the construction of a 
restricted vocabulary, tightly controlled by the repository administrators. Repos- 
itory users must learn this restricted vocabulary, or rely upon the assistance of 
consultants already familiar with it. It is rarely the case, however, that the choice 
is between two synonyms. More typically it is between words which have similar, 
but distinct, meanings (e.g., insert, push, and enqueue, as above). 


3.3 Algebraic Specification 

While not really a classification technique, algebraic specification techniques (e.g., 
[GH78]) partially (and unintentionally) overcome the vocabulary problem through 
inclusion of behavioral axioms into the specification. The main objection to the use 
of algebraic specifications in reuse is the need to actually write and comprehend 
the specifications. The traditional examples in the literature rarely exceed the 
complexity of the above Figures. Also, algebraic techniques poorly address issues 
such as performance and concurrency. 

A repository containing algebraic specifications depends upon the expertise 
of the reusers browsing the repository; small repositories are easily understood 
whereas it is unreasonable to require a reuser to examine all components in a 
large repository for suitability. 
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3.4 Basic Faceted Classification 


Basic faceted classification begins by using domain analysis (aka literary warrant) 
“to derive faceted classification schemes of domain specific objects.” The classifier 
not only derives terms for grouping, but also identifies a vocabulary that serves 
as the values that populate those groups. From the software perspective, the 
groupings, or facets become a taxonomy for the software. 

Prieto-Diaz and Freeman identified six facets: function, object, medium, 
system type, functional area, and setting [PDF87]. Each software component in 
the repository has a value assigned for each of these facets. The software reuser 
locates software components by specifying facet values that are descriptive of 
the software desired. In the event that a given user query has no matches in 
the repository, the query may be relaxed by wild-carding particular facets in the 
query, thereby generalizing it. 

The primary drawback in this approach is the flatness and homogeneity 
of the classification structure. A general-purpose reuse system might contain not 
only reusable components, but also design documents, formal specifications, and 
perhaps vendor product information. Basic faceted classification creates a single 
tuple space for all entries, resulting in numerous facets, tuples with many “not 
applicable” entries for those facets, and frequent wildcarding in user queries. 

A number of reuse repository projects have incorporated faceted classifi- 
cation as a retrieval mechanism (e.g., [Gue87][Atk]), but they primarily address 
the vocabulary problem through a keyword control board, charged with creating 
a controlled vocabulary for classification. 

Gagliano, et. al. computed conceptual closeness measures to define a 
semantic distance between two facet values [GOF + 88]. The two principle limita- 
tions to this approach are the static nature of the distance metrics and the lack 
of inter- facet dependencies; each of the facets had its own closeness matrix. 


3.5 Lattice-Based Faceted Classification 

Eichmann and Atkins extended basic faceted classification by incorporating a 
lattice as the principle structuring mechanism in the classification scheme [EA90]. 
As shown in Figure 4, there are two major sublattices making up the overall 
lattice. 
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Figure 4: The Type Lattice 

On the left is the sublattice comprised of sets of facet values (for clarity, 
shown here with only three facets), partially ordered by the subset relation. The 
Facets vertex in the lattice represents the empty facet set, while the Facet vertex 
represents the set of all facet values in the classification scheme. Each member of 
the power set of all facet values falls somewhere within this sublattice. 

On the right is the tuple sublattice, containing facet set components, and 
partially ordered by the subtype relation [Eic89]. The vertex denotes the empty 
tuple. The tuple vertex denotes the tuple containing all possible facet components, 
with each component containing all the values for that facet. Adding facet values 
to a component or adding a new component to a tuple instance moves the tuple 
instance down through the lattice. 

Queries to a repository supporting lattice-based faceted classification are 
similar to those to one supporting basic faceted classification, with two important 
distinctions - query tuples can mention as many or as few facets as the reuser 
wishes, thereby avoiding the need for wildcarding, and classifiers can similarly 
classify a given component with as many or as few facets as are needed for precise 
characterization of the component. 

Lattice-based faceted classification avoids conceptual closeness issues through 
the specification of sets of facet values in the classification of components. If there 
are a number of semantically close facet values that all characterize the compo- 
nent, all are included in the facet instance for that component. This avoids the 
need to generate closeness metrics for facet values, but it also may result in reuser 
confusion about just what the component does. 
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3.6 Towards Adaptive Classification and Retrieval 

The principle failing in the methods described so far is the static nature of the 
classification. Once a component has been classified, it remains unchanged until 
the repository administrators see fit to change it. This is unlikely to occur unless 
those same administrators closely track reuser retrieval success, and more impor- 
tantly, retrieval failure - particularly in those cases where there are components 
in the repository matching reuser requirements, but those components were not 
identified during the query session. 

Manual adjustment of closeness metrics becomes increasingly unreasonable 
as the scale of the repository increases. The number of connections in the con- 
ceptual graph is combinatorially explosive. The principle design goal in our work 
is the creation of an adaptive query mechanism - one capable of altering its be- 
havior based upon implicit user feedback. This feedback appears in two guises; 
failed queries, addressed by widening the scope of the query; and reuser refusals, 
cases where candidate components were presented to the reuser, but not selected 
for retrieval. The lattice provides a nice structure for the former, but a different 
approach is required for the latter. 


4 Our Approach 

We are currently designing a new retrieval mechanism using previous work de- 
scribed in [EA90] as a starting point, and employing neural networks to address 
the vocabulary and refusal problems. The motivations behind using neural net- 
works include: 

• Associative Retrieval from Noisy and Incomplete Cues: Traditional 
methods for component retrieval are based on strict pattern matching meth- 
ods such as unification. In other words, the query should contain exact infor- 
mation about the component(s) in the repository. Since exact information 
about components is usually not known, queries fail in cases where exact 
matching does not occur. Associative retrieval based on neural networks 
uses relaxation, retrieving components based on partial/approximate/best 
matches. This is sometimes referred to as data fault tolerance and is ideally 
suited for our problem domain. 

• Classification and Optimization by Adaptation: In approaches using 
the conceptual closeness measure, the problem of defining correlations be- 
tween various components and assigning a numerical correlation value rests 
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upon the designer or the administrator of the repository. Designers idiosyn- 
cratically arrive at these correlations and their values, which may not be 
appropriate from the perspective of the software retriever/reuser. It is our 
belief that the best way to arrive at these correlations and their values is for 
the system to learn them in responding to user queries. 

We also intend to use another adaptation strategy for optimizing the re- 
trieval of similar repetitive queries. Since in most situations, reusers repeat- 
edly issue similar queries, the system will adapt to these queries by weight 
adjustment. The weight adjustment will settle the relaxation process quickly 
in response to these repetitive queries and hence result in faster retrieval. 
The effect here is similar to that of caching frequently issued queries. Note, 
however, that once the system has learned that two concepts are conceptu- 
ally close, we want it to remember this, irrespective of how often the reusers 
inquire about it. 

• Massive Parallelism: The neurocomputing paradigm is characterized by 
asynchronous, massively parallel, simple computations. Since neural net- 
works are massively parallel, retrieval from large repositories is possible, 
using the fast associative search techniques that are natural and inherent in 
these networks. 


5 System Architecture 

In this section, we describe some of the potential neural-network architectures and 
discuss their strengths and limitations in employing them for our task. 

5.1 Hopfield Networks 

These networks can be used as content-addressable or associative memories. Ini- 
tially the weights in the network are set using representative samples from all 
the exemplar classes- After this initialization, the input pattern I is presented to 
the network. The network then iterates and converges to a output. This output 
represents the exemplar class which matches the input pattern best. 

Although this network has many properties that are desirable for our sys- 
tem, some of the serious limitations in our context include: 

1. The networks have limited capacity [Lip87] and may converge to novel spu- 
rious patterns. 
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2. They result in unstable exemplar patterns if many bits are shared among 
multiple exemplar patterns. 

3. There are no algorithms to incrementally train these networks, i.e., to adjust 
the initial weights in a manner that creates a specific alteration in subsequent 
query responses. This is important for our application, since we seek an 
architecture capable of adapting over time to user feedback. 

5.2 Supervised Learning Algorithms 

Many good supervised learning algorithms exist, including backpropagation [RHW86], 
cascade correlation and others, but they cannot be used in this context because 
our problem requires an u nsupervised learning algorithm. Hence, we are investi- 
gating unsupervised learning architectures, such as Adaptive Resonance Theory 
(ART) [Gro88]. 

5.3 ART 

ART belongs to a class of learning architectures known as competitive learning 
models [Gro88][CG88]. The competitive learning models are usually characterized 
by a network consisting of two layers L\ and Lj. The input pattern / is fed into 
layer L\ where it is normalized. The normalized input is fed forward to layer Li 
through the weighted interconnection links that forms an adaptive filter. Layer 
L 2 is organized as a winner-take- all network [FB82][Sri91][BSD90]. The network 
layer Z-j is usually organized as a mutually inhibitory network wherein each unit in 
the network inhibits every other unit in the network through a value proportional 
to the strength of its activation. Layer Li has the task of selecting the network 
node a^, receiving the maximum total input from L x . The node a max is said to 
cluster or code the input pattern I. 

In the ART system the input pattern / is fed in to the lower layer L\. This 
input is normalized and is fed forward to layer Lj. This results in a network node 
rimu of layer Li being selected by virtue of it having the maximum activation 
value among all the nodes in the layer. This node n max represents the hypothesis 
H put forth by the network about the particular classification of the input I. Now 
a matching phase occurs wherein the hypothesis H and the input I are matched, 
with the quality of the required match controlled by the vigilance parameter. 

If the quality of match is worse than the value specified in the vigilance 
parameter, a mismatch occurs and the layer L 3 is reset thereby deactivating node 
n max- The input I activates another node and the above process recurs, comparing 
another hypothesis or forming a new hypothesis about the input pattern I. New 
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hypotheses are formed by learning new classes and recruiting new uncommitted 
nodes to represent these classes. 

Some of the properties of ART that makes it an potential choice for our 
task include 

1. Real-time (on-line) learning; 

2. Unsupervised learning; 

3. Fast adaptive search for best match as opposed to strict match; and 

4. Variable error criterion which can be fine-tuned by appropriately setting the 
vigilance parameter. 

However, one of the limitations of ART for our particular task arises from 
its inability to distinguish the queries for particular components by users, from the 
component classes which form the exemplar classes. Another limitation arises from 
the fact that only one exemplar class is chosen at a time which represents the best 
match, rather than choosing a collection of close matches for reuser consideration. 

Our proposed system will operate in two phases. The first, loading phase 
populates the repository with components. The second, retrieval phase identi- 
fies candidate components in response to user queries. The distinguishing factor 
between the two phases is the value of the vigilance parameter. In the loading 
phase, the system will employ a high vigilance value. This ensures the forma- 
tion of separate categories for each of the components in the repository. In the 
retrieval phase, the system will employ a low vigilance value, thereby retrieving 
components that best match the query. 

We also intend to modify the winner-take-all network layer of the ART to 
choose k winners instead of one. This is extremely useful in our context because 
there may be multiple software components which meet the user specifications. 
The software reuser may select a subset tji < k of these components based upon 
requirements. The system should associate these m components with the user 
query and retrieve them for subsequent queries having similar input specifications. 
This can be achieved by associating small initial weights on the lateral links. of the 
winner-take-all network and modifying them appropriately based on user feedback 
(i.e., reuser refusals). 
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6 Discussion 


6.1 Our Placement in the User-Based Framework 

Discussions in the workshop placed our work in the region of user intention / no 
feedback in the user-based framework. Upon further reflection, we have slightly 
altered our perspective. While this placement is certainly proper in the strict 
context of a single user query, it is not accurate in the broader context of a 
community of users accessing the repository over time. 

As the system is rewarded for providing true hits to users and punished for 
providing false hits, there is a consensual drift, providing feedback for subsequent 
user queries. Thus, viewing the amortized effect of user behavior, rather than the 
immediate effect of user behavior, our system shifts down towards passive obser- 
vation and left towards immediate feedback . 3 The net result is that our system 
occupies two distinct points in the framework, one for the semantics involved in 
the immediate query query and one for the semantics involved in the aggregate 
behavior of the repository over time. 


6.2 The Relationship to Gestural Recognition 

Beale [BE], Rubine [Rub], and Zhao [Zha], the other occupants of the Novel Input 
category of the task- based framework, respectively address sign language recogni- 
tion, drawing geometric figures, and diagram editing - all interpreting imprecise 
human gestures and mapping them to a precise application domain. They all 
address the inability of humans to accurately repeat physical movement. 

Our mechanism, on the other hand, accepts a precisely phrased user query 
and adapts it to an imprecise application domain. Ignoring the issue of poor 
typing skills, our user community can accurately repeat a given user intention 
(query) any number of times, and we know exactly what that intention is. The 
challenge in our domain occurs when that intention has no exact match in the 
system. It’s similar to Rubine’s system offering to draw a square or a hexagram 
(or perhaps even a five-sided star) when the user gestured a pentagram, but the 
system had no training in pentagram gestures. 

2 or more precisely, non- immediate feedback. 
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6.3 Directions for Future Research 

Options available to us at this point in our work lie in two general directions, 
further extending repository semantics and exploring the application of neural 
networks to these types of application domains. 

With respect to the former, the classification scheme described here is 
restricted to facets and tuples containing facets. In other work, the classification 
scheme was first extended to include signatures for abstract data types [Eic91a] 
and then further extended to support axioms in a second phase in the query 
process [Eic91b]. A merger of that work with that described here has appeal — 
particularly the imprecise matching of signatures. 

With respect to the latter, we are interested in studying the tradeoffs 
between individual user adaptation versus the consensual adaptation described 
above. These two actually are the extremes in a continuum of user groupings. 
This coupled with an additional dimension of user expertise forms a state space of 
user behavior where the system might more heavily weight certain semantic con- 
nections for experts and other semantic connections for novices. This will require 
the development of new algorithms for relaxation. 


T Conclusions 

Our approach extends previous work in component retrieval by incrementally 
adapting the conceptual closeness weights based upon actual use, rather than an 
administrator’s assumptions. Neural networks provide a quite suitable framework 
for supporting this adaptation. Reuse repository retrieval provides a unique and 
challenging application domain for neural networking techniques. 

This approach effectively adds an additional dimension to the conceptual 
space formed by the type lattice. This additional dimension allows traversal from 
one vertex to another using the adapted closeness weights derived from user ac- 
tivity, rather than the partial orders used in defining the lattice. The resulting 
retrieval mechanism supports both well-defined lattice-constrained queries and 
ill-defined neural-network constrained queries in the same framework. 
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